tidyverseR
tidyverse)tidyverse?
rvest quick start guide
rvest
RR PackagesLots to choose from: XML, XML2R, scrapeR, selectr, rjson, RSelenium, etc.
Many more (and links to the above) on the Web Technologies CRAN Task View
But, we’ll be using the tidyverse packages rvest and xml2
tidyverse?“The tidyverse is a set of packages that work in harmony…. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.” - RStudio Blog
You may already have used:
ggplot2 for visualizationdplyr for data manipulationtidyr for data tidyingInstall all tidyverse packages in one fell swoop:
tidyverse packageshttr: for web APIs (Application Programming Interface)jsonlite: for JSON (JavaScript Object Notation) data from the webxml2: for XML (eXtensible Markup Language) structured datarvest: package of wrapper functions to xml2 and httr for easy web scrapingWe’ll focus on rvest
rvest:What data do you want?
Find it on the web!
R“Huh? What am I doing?” - some of you right now
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
Need to find your data within the myhtml object.
Tags to look for:
<p>: paragraphs<h1>, <h2>, etc.: headers<a>: links<li>: item in a list<table>: tablesUse Selector Gadget to find the exact location. (Demo)
For more on HTML, I recommend W3schools’ tutorial >- You don’t need to be an expert in HTML to webscrape with rvest!
rvest where to find your dataCopy-paste from Selector Gadget or give HTML tags into html_nodes() to extract your data of interest
## [1] "\n A young African-American man grapples with his identity and sexuality while experiencing the everyday struggles of childhood, adolescence, and burgeoning adulthood.\n "
## [[1]]
## Cast overview, first billed only: Cast overview, first billed only:
## 1 NA Mahershala Ali
## 2 NA Shariff Earp
## 3 NA Duan Sanderson
## 4 NA Alex R. Hibbert
## 5 NA Janelle Monáe
## 6 NA Naomie Harris
## 7 NA Jaden Piner
## 8 NA Herman 'Caheei McGloun
## 9 NA Kamal Ani-Bellow
## 10 NA Keomi Givens
## 11 NA Eddie Blanchard
## 12 NA Rudi Goblen
## 13 NA Ashton Sanders
## 14 NA Edson Jean
## 15 NA Patrick Decile
## Cast overview, first billed only:
## 1 ...
## 2 ...
## 3 ...
## 4 ...
## 5 ...
## 6 ...
## 7 ...
## 8 ...
## 9 ...
## 10 ...
## 11 ...
## 12 ...
## 13 ...
## 14 ...
## 15 ...
## Cast overview, first billed only:
## 1 Juan
## 2 Terrence
## 3 Azu \n \n \n (as Duan 'Sandy' Sanderson)
## 4 Little \n \n \n (as Alex Hibbert)
## 5 Teresa
## 6 Paula
## 7 Kevin age 9
## 8 Longshoreman \n \n \n (as Herman 'Caheej' McGloun)
## 9 Portable Boy 1
## 10 Portable Boy 2
## 11 Portable Boy 3
## 12 Gee \n \n \n (as Rudi Goblin)
## 13 Chiron
## 14 Mr. Pierce
## 15 Terrel
library(stringr)
library(magrittr)
mydat <- myhtml %>%
html_nodes("table") %>%
extract2(1) %>%
html_table(header = TRUE)
mydat <- mydat[,c(2,4)]
names(mydat) <- c("Actor", "Role")
mydat <- mydat %>%
mutate(Actor = Actor,
Role = str_replace_all(Role, "\n ", ""))
mydat## Actor Role
## 1 Mahershala Ali Juan
## 2 Shariff Earp Terrence
## 3 Duan Sanderson Azu (as Duan 'Sandy' Sanderson)
## 4 Alex R. Hibbert Little (as Alex Hibbert)
## 5 Janelle Monáe Teresa
## 6 Naomie Harris Paula
## 7 Jaden Piner Kevin age 9
## 8 Herman 'Caheei McGloun Longshoreman (as Herman 'Caheej' McGloun)
## 9 Kamal Ani-Bellow Portable Boy 1
## 10 Keomi Givens Portable Boy 2
## 11 Eddie Blanchard Portable Boy 3
## 12 Rudi Goblen Gee (as Rudi Goblin)
## 13 Ashton Sanders Chiron
## 14 Edson Jean Mr. Pierce
## 15 Patrick Decile Terrel
Using rvest, scrape a table from Wikipedia. You can pick your own table or you can get one of the tables in the country GDP per capita example from earlier.
Your result should be a data frame with one observation per row and one variable per column.
library(rvest)
library(magrittr)
myurl <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita"
myhtml <- read_html(myurl)
myhtml %>%
html_nodes("table") %>%
extract2(3) %>%
html_table(header = TRUE, fill = T) %>%
mutate(`Int$` = parse_number(`Int$`)) %>%
head## Rank Country/Territory Int$
## 1 1 Qatar 138910
## 2 — Macau 113352
## 3 2 Luxembourg 112045
## 4 3 Singapore 105689
## 5 4 Ireland 86988
## 6 5 Brunei 85011
rvesthtml_nodeshtml_nodes(x, "path") extracts all elements from the page x that have the tag / class / id path. (Use SelectorGadget to determine path.)html_node() does the same thing but only returns the first matching element.myhtml %>%
html_nodes("p") %>% # first get all the paragraphs
html_nodes("a") # then get all the links in those paragraphs## {xml_nodeset (39)}
## [1] <a href="/wiki/Gross_domestic_product" title="Gross domestic product">gr ...
## [2] <a href="/wiki/Purchasing_power_parity" title="Purchasing power parity"> ...
## [3] <a href="/wiki/Goods_and_services" title="Goods and services">goods and ...
## [4] <a href="/wiki/International_dollar" title="International dollar">Int$</a>
## [5] <a href="#cite_note-world-2019-3">[n 1]</a>
## [6] <a href="/wiki/List_of_countries_by_wealth_per_adult" title="List of cou ...
## [7] <a href="/wiki/Gross_domestic_product" title="Gross domestic product">gr ...
## [8] <a href="/wiki/Per_capita" title="Per capita">per capita</a>
## [9] <a href="/wiki/IMF" class="mw-redirect" title="IMF">IMF</a>
## [10] <a href="/wiki/World_Bank" title="World Bank">World Bank</a>
## [11] <a href="/wiki/Savings" class="mw-redirect" title="Savings">savings</a>
## [12] <a href="/wiki/Cost_of_living" title="Cost of living">cost of living</a>
## [13] <a href="/wiki/List_of_countries_by_GDP_(nominal)_per_capita" title="Lis ...
## [14] <a href="https://en.wiktionary.org/wiki/generalized" class="extiw" title ...
## [15] <a href="/wiki/Living_standards" class="mw-redirect" title="Living stand ...
## [16] <a href="/wiki/Inflation_rates" class="mw-redirect" title="Inflation rat ...
## [17] <a href="/wiki/Exchange_rates" class="mw-redirect" title="Exchange rates ...
## [18] <a href="#cite_note-4">[3]</a>
## [19] <a href="#cite_note-5">[4]</a>
## [20] <a href="/wiki/Personal_income" title="Personal income">personal income</a>
## ...
html_texthtml_text(x) extracts all text from the nodeset xmyhtml %>%
html_nodes("p") %>% # first get all the paragraphs
html_nodes("a") %>% # then get all the links in those paragraphs
html_text() # get the linked text only ## [1] "gross domestic product"
## [2] "purchasing power parity"
## [3] "goods and services"
## [4] "Int$"
## [5] "[n 1]"
## [6] "list of countries by wealth per adult"
## [7] "gross domestic product"
## [8] "per capita"
## [9] "IMF"
## [10] "World Bank"
## [11] "savings"
## [12] "cost of living"
## [13] "List of countries by GDP (nominal) per capita"
## [14] "generalized"
## [15] "living standards"
## [16] "inflation rates"
## [17] "exchange rates"
## [18] "[3]"
## [19] "[4]"
## [20] "personal income"
## [21] "Standard of living and GDP"
## [22] "international dollars"
## [23] "rounded"
## [24] "whole number"
## [25] "economies"
## [26] "sovereign states"
## [27] "dependent territories"
## [28] "tax havens"
## [29] "corporate tax havens"
## [30] "[9]"
## [31] "tax haven lists"
## [32] "[10]"
## [33] "[11]"
## [34] "leprechaun economics"
## [35] "BEPS"
## [36] "modified gross national income"
## [37] "corporate tax havens"
## [38] "major global tax havens"
## [39] "GDP-per-capita tax haven proxy"
html_tablehtml_table(x, header, fill) - parse html table(s) from x into a data frame or list of data frames## {xml_nodeset (2)}
## [1] <table width="100%"><tbody><tr>\n<td valign="top"> <div class="legend" st ...
## [2] <table style="font-size:100%;"><tbody>\n<tr>\n<td width="30%" align="cent ...
myhtml %>%
html_nodes("table") %>% # get the tables
extract2(3) %>% # pick the second one to parse
html_table(header = TRUE) # parse table ## Rank Country/Territory Int$
## 1 1 Qatar 138,910
## 2 — Macau 113,352
## 3 2 Luxembourg 112,045
## 4 3 Singapore 105,689
## 5 4 Ireland 86,988
## 6 5 Brunei 85,011
## 7 6 Norway 79,638
## 8 7 United Arab Emirates 70,441
## 9 8 Kuwait 67,891
## 10 9 Switzerland 67,558
## 11 10 United States 67,426
## 12 — Hong Kong 66,527
## 13 11 San Marino 62,913
## 14 12 Netherlands 60,299
## 15 — Taiwan 57,214
## 16 13 Iceland 56,974
## 17 14 Saudi Arabia 56,912
## 18 15 Sweden 55,989
## 19 16 Denmark 55,675
## 20 17 Germany 55,306
## 21 18 Austria 55,171
## 22 19 Australia 54,799
## 23 20 Canada 52,144
## 24 21 Bahrain 51,991
## 25 22 Belgium 50,904
## 26 23 Malta 49,589
## 27 24 Finland 49,548
## 28 25 France 48,640
## 29 26 Oman 48,593
## 30 27 United Kingdom 48,169
## 31 28 Japan 46,827
## 32 29 Korea, South 46,452
## 33 30 Spain 43,007
## 34 31 Cyprus 42,956
## 35 32 New Zealand 42,045
## 36 33 Italy 41,582
## 37 — Puerto Rico 41,198
## 38 34 Czech Republic 40,585
## 39 35 Slovenia 40,344
## 40 36 Israel 40,337
## 41 37 Lithuania 38,751
## 42 38 Slovakia 38,321
## 43 39 Estonia 37,606
## 44 40 Hungary 35,941
## 45 41 Poland 35,651
## 46 42 Portugal 34,936
## 47 43 Malaysia 34,567
## 48 44 Trinidad and Tobago 33,713
## 49 45 Bahamas, The 33,432
## 50 46 Seychelles 33,118
## 51 47 Latvia 32,987
## 52 48 Saint Kitts and Nevis 31,950
## 53 49 Greece 31,616
## 54 50 Russia 30,820
## 55 51 Antigua and Barbuda 30,593
## 56 52 Kazakhstan 30,178
## 57 53 Romania 29,555
## 58 54 Turkey 29,327
## 59 55 Croatia 29,207
## 60 56 Panama 28,456
## 61 57 Chile 27,150
## 62 58 Mauritius 26,461
## 63 59 Bulgaria 26,034
## 64 60 Maldives 24,796
## 65 61 Uruguay 24,516
## 66 62 Montenegro 21,977
## 67 63 Turkmenistan 21,855
## 68 64 Mexico 21,363
## 69 65 Thailand 21,361
## 70 66 Belarus 21,224
## 71 67 People's Republic of China 20,984
## 72 68 Dominican Republic 20,625
## 73 69 Argentina 19,971
## 74 70 Equatorial Guinea 19,961
## 75 71 Gabon 19,839
## 76 72 Serbia 19,767
## 77 73 Botswana 19,388
## 78 74 Barbados 19,364
## 79 75 Azerbaijan 19,156
## 80 76 Iraq 18,755
## 81 77 Costa Rica 18,651
## 82 — World[n 2] 18,391
## 83 78 Iran 17,832
## 84 79 Grenada 17,434
## 85 80 North Macedonia 17,378
## 86 81 Guyana 17,163
## 87 82 Brazil 17,106
## 88 83 Palau 16,855
## 89 84 Colombia 16,265
## 90 85 Algeria 16,091
## 91 86 Suriname 16,044
## 92 87 Lebanon 15,599
## 93 88 Peru 15,399
## 94 89 Saint Lucia 15,159
## 95 90 Mongolia 15,089
## 96 91 Bosnia and Herzegovina 14,894
## 97 92 Albania 14,866
## 98 93 Indonesia 14,841
## 99 94 Egypt 14,800
## 100 95 Sri Lanka 14,509
## 101 96 South Africa 13,965
## 102 97 Paraguay 13,213
## 103 98 Georgia 13,200
## 104 99 Tunisia 13,093
## 105 — Kosovo 13,017
## 106 100 Saint Vincent and the Grenadines 12,983
## 107 101 Dominica 12,851
## 108 102 Fiji 12,689
## 109 103 Ecuador 11,866
## 110 104 Armenia 11,845
## 111 105 Namibia 11,451
## 112 106 Eswatini 11,139
## 113 107 Bhutan 10,627
## 114 108 Ukraine 10,130
## 115 109 Philippines 10,094
## 116 110 Jordan 9,939
## 117 111 Jamaica 9,932
## 118 112 Morocco 9,667
## 119 113 Uzbekistan 9,595
## 120 114 Libya 9,446
## 121 115 Nauru 9,073
## 122 116 India 9,027
## 123 117 Guatemala 9,009
## 124 118 Belize 8,791
## 125 119 Cape Verde 8,716
## 126 120 Laos 8,684
## 127 121 Vietnam 8,677
## 128 122 El Salvador 8,593
## 129 123 Bolivia 8,525
## 130 124 Moldova 8,161
## 131 125 Congo, Republic of the 7,336
## 132 126 Ghana 7,343
## 133 127 Myanmar 7,220
## 134 128 Tonga 6,867
## 135 129 Angola 6,763
## 136 130 Samoa 6,493
## 137 131 Nigeria 6,172
## 138 132 Pakistan 6,016
## 139 133 Djibouti 5,855
## 140 134 Honduras 5,600
## 141 135 Bangladesh 5,453
## 142 136 Timor-Leste 5,321
## 143 137 Nicaragua 5,297
## 144 138 Mauritania 5,158
## 145 139 Cambodia 5,004
## 146 140 Côte d'Ivoire 4,754
## 147 141 Tuvalu 4,535
## 148 142 Kyrgyzstan 4,193
## 149 143 Zambia 4,174
## 150 144 Cameroon 4,099
## 151 145 Papua New Guinea 4,081
## 152 146 Senegal 4,079
## 153 147 Kenya 4,078
## 154 148 Sudan 3,986
## 155 149 Marshall Islands 3,972
## 156 150 Tajikistan 3,751
## 157 151 Micronesia, Federated States of 3,657
## 158 152 Lesotho 3,655
## 159 153 Tanzania 3,652
## 160 154 Benin 3,648
## 161 155 Nepal 3,550
## 162 156 São Tomé and Príncipe 3,499
## 163 157 Vanuatu 3,039
## 164 158 Comoros 2,898
## 165 159 Gambia, The 2,892
## 166 160 Zimbabwe 2,778
## 167 161 Uganda 2,753
## 168 162 Ethiopia 2,702
## 169 163 Rwanda 2,642
## 170 164 Chad 2,603
## 171 165 Guinea 2,574
## 172 166 Mali 2,569
## 173 167 Solomon Islands 2,363
## 174 168 Yemen 2,312
## 175 169 Kiribati 2,193
## 176 170 Afghanistan 2,182
## 177 171 Burkina Faso 2,181
## 178 172 Guinea-Bissau 2,113
## 179 173 Haiti 1,916
## 180 174 Togo 1,913
## 181 175 Madagascar 1,776
## 182 176 Sierra Leone 1,765
## 183 177 South Sudan 1,715
## 184 178 Liberia 1,428
## 185 179 Mozambique 1,372
## 186 180 Malawi 1,292
## 187 181 Niger 1,152
## 188 182 Eritrea 1,103
## 189 183 Congo, Democratic Republic of the 873
## 190 184 Central African Republic 864
## 191 185 Burundi 724
## 192 — Syria n/a
## 193 — Venezuela n/a
html_attrshtml_attrs(x) - extracts all attribute elements from a nodeset xhtml_attr(x, name) - extracts the name attribute from all elements in nodeset xhref, title, class, style, etc.## style
## "font-size:100%;"
## [1] "/wiki/Gross_domestic_product"
## [2] "/wiki/Purchasing_power_parity"
## [3] "/wiki/Goods_and_services"
## [4] "/wiki/International_dollar"
## [5] "#cite_note-world-2019-3"
## [6] "/wiki/List_of_countries_by_wealth_per_adult"
## [7] "/wiki/Gross_domestic_product"
## [8] "/wiki/Per_capita"
## [9] "/wiki/IMF"
## [10] "/wiki/World_Bank"
## [11] "/wiki/Savings"
## [12] "/wiki/Cost_of_living"
## [13] "/wiki/List_of_countries_by_GDP_(nominal)_per_capita"
## [14] "https://en.wiktionary.org/wiki/generalized"
## [15] "/wiki/Living_standards"
## [16] "/wiki/Inflation_rates"
## [17] "/wiki/Exchange_rates"
## [18] "#cite_note-4"
## [19] "#cite_note-5"
## [20] "/wiki/Personal_income"
## [21] "/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_Wealth_distribution_and_externalities"
## [22] "/wiki/International_dollar"
## [23] "/wiki/Rounding"
## [24] "/wiki/Integer"
## [25] "/wiki/Economy"
## [26] "/wiki/Sovereign_state"
## [27] "/wiki/Dependent_territories"
## [28] "/wiki/Tax_havens"
## [29] "/wiki/Corporate_tax_haven"
## [30] "#cite_note-qqtz-11"
## [31] "/wiki/Tax_haven#Tax_haven_lists"
## [32] "#cite_note-dhar-12"
## [33] "#cite_note-imfx-13"
## [34] "/wiki/Leprechaun_economics"
## [35] "/wiki/BEPS"
## [36] "/wiki/Modified_gross_national_income"
## [37] "/wiki/Corporate_tax_haven"
## [38] "/wiki/Tax_haven#Top_10_tax_havens"
## [39] "/wiki/Corporate_haven#GDP-per-capita_tax_haven_proxy"
html_children - list the “children” of the HTML page. Can be chained like html_nodeshtml_name - gives the tags of a nodeset. Use in a chain with html_children## [1] "head" "body"
html_form - parses HTML forms (checkboxes, fill-in-the-blanks, etc.)html_session - simulate a session in an html browser; use the functions jump_to, back to navigate through the pageFind another website you want to scrape (ideas: all bills in the house so far this year, video game reviews, anything Wikipedia) and use at least 3 different rvest functions in a chain to extract some data.
url <- "http://avalon.law.yale.edu/subject_menus/inaug.asp"
# even though it's called "all inaugs" some are missing
all_inaugs <- (url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill=T, header = T)) %>% extract2(3)
# table of addresses
all_inaugs_tidy <- all_inaugs %>%
gather(term, year, -President) %>%
filter(!is.na(year)) %>%
select(-term) %>%
arrange(year)
head(all_inaugs_tidy)## President year
## 1 George Washington 1789
## 2 George Washington 1793
## 3 John Adams 1797
## 4 Thomas Jefferson 1801
## 5 Thomas Jefferson 1805
## 6 James Madison 1809
# get the links to the addresses
inaugadds_adds <- (url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href"))[12:66]
# create the urls to scrape
urlstump <- "http://avalon.law.yale.edu/"
inaugurls <- paste0(urlstump, str_replace(inaugadds_adds, "../", ""))
all_inaugs_tidy$url <- inaugurls
head(all_inaugs_tidy)## President year url
## 1 George Washington 1789 http://avalon.law.yale.edu/18th_century/wash1.asp
## 2 George Washington 1793 http://avalon.law.yale.edu/18th_century/wash2.asp
## 3 John Adams 1797 http://avalon.law.yale.edu/18th_century/adams.asp
## 4 Thomas Jefferson 1801 http://avalon.law.yale.edu/19th_century/jefinau1.asp
## 5 Thomas Jefferson 1805 http://avalon.law.yale.edu/19th_century/jefinau2.asp
## 6 James Madison 1809 http://avalon.law.yale.edu/19th_century/madison1.asp
get_inaugurations <- function(url){
test <- try(url %>% read_html(), silent=T)
if ("try-error" %in% class(test)) {
return(NA)
} else
url %>% read_html() %>%
html_nodes("p") %>%
html_text() -> address
return(unlist(address))
}
# takes about 30 secs to run
all_inaugs_text <- all_inaugs_tidy %>%
mutate(address_text = (map(url, get_inaugurations)))
all_inaugs_text$address_text[[1]]## [1] " Fellow-Citizens of the Senate and of the House of Representatives: "
## [2] "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years--a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow-citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated. "
## [3] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow- citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence. "
## [4] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people. "
## [5] "Besides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted. "
## [6] "To the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require. "
## [7] "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "
## [1] "Martin Van Buren" "James Buchanan" "James A. Garfield"
## [4] "Calvin Coolidge"
# there are 7 missing at this point: obama's and trump's, plus coolidge, garfield, buchanan, and van buren, which errored in the scraping.
obama09 <- get_inaugurations("http://avalon.law.yale.edu/21st_century/obama.asp")
obama13 <- readLines("speeches/obama2013.txt")
trump17 <- readLines("speeches/trumpinaug.txt")
vanburen1837 <- readLines("speeches/vanburen1837.txt") # row 13
buchanan1857 <- readLines("speeches/buchanan1857.txt") # row 18
garfield1881 <- readLines("speeches/garfield1881.txt") # row 24
coolidge1925 <- readLines("speeches/coolidge1925.txt") # row 35
all_inaugs_text$address_text[c(13,18,24,35)] <- list(vanburen1837,buchanan1857, garfield1881, coolidge1925)
# lets combine them all now
recents <- data.frame(President = c(rep("Barack Obama", 2),
"Donald Trump"),
year = c(2009, 2013, 2017),
url = NA,
address_text = NA)
all_inaugs_text <- rbind(all_inaugs_text, recents)
all_inaugs_text$address_text[c(56:58)] <- list(obama09, obama13, trump17)Now, I use the tidytext package to get the words out of each inaugural address.
# install.packages("tidytext")
library(tidytext)
all_inaugs_text %>%
select(-url) %>%
unnest() %>%
unnest_tokens(word, address_text) -> presidential_words
head(presidential_words)## # A tibble: 6 x 3
## President year word
## <chr> <dbl> <chr>
## 1 George Washington 1789 fellow
## 2 George Washington 1789 citizens
## 3 George Washington 1789 of
## 4 George Washington 1789 the
## 5 George Washington 1789 senate
## 6 George Washington 1789 and
presidential_words %>%
group_by(President,year) %>%
summarize(num_words = n()) %>%
arrange(desc(num_words)) -> presidential_wordtotalsFirst, get all the URLs for the Wikipedia articles for the years of 1987-2016.
Next, create a data frame to store all of the data.
## {xml_nodeset (7)}
## [1] <h2 id="mw-toc-heading">Contents</h2>\n
## [2] <h2>\n<span class="mw-headline" id="Events">Events</span><span class="mw- ...
## [3] <h2>\n<span class="mw-headline" id="Births">Births</span><span class="mw- ...
## [4] <h2>\n<span class="mw-headline" id="Deaths">Deaths</span><span class="mw- ...
## [5] <h2>\n<span class="mw-headline" id="Nobel_Prizes">Nobel Prizes</span><spa ...
## [6] <h2>\n<span class="mw-headline" id="References">References</span><span cl ...
## [7] <h2>Navigation menu</h2>
## {xml_nodeset (1469)}
## [1] <li><a href="/wiki/19th_century" title="19th century">19th century</a></li>
## [2] <li><b><a href="/wiki/20th_century" title="20th century">20th century</a ...
## [3] <li><a href="/wiki/21st_century" title="21st century">21st century</a></li>
## [4] <li><a href="/wiki/1960s" title="1960s">1960s</a></li>
## [5] <li><a href="/wiki/1970s" title="1970s">1970s</a></li>
## [6] <li><b><a href="/wiki/1980s" title="1980s">1980s</a></b></li>
## [7] <li><a href="/wiki/1990s" title="1990s">1990s</a></li>
## [8] <li><a href="/wiki/2000s_(decade)" title="2000s (decade)">2000s</a></li>
## [9] <li><a href="/wiki/1984" title="1984">1984</a></li>
## [10] <li><a href="/wiki/1985" title="1985">1985</a></li>
## [11] <li><a href="/wiki/1986" title="1986">1986</a></li>
## [12] <li><b><a class="mw-selflink selflink">1987</a></b></li>
## [13] <li><a href="/wiki/1988" title="1988">1988</a></li>
## [14] <li><a href="/wiki/1989" title="1989">1989</a></li>
## [15] <li><a href="/wiki/1990" title="1990">1990</a></li>
## [16] <li><a href="/wiki/1987_in_archaeology" title="1987 in archaeology">Arch ...
## [17] <li><a href="/wiki/1987_in_architecture" title="1987 in architecture">Ar ...
## [18] <li><a href="/wiki/1987_in_art" title="1987 in art">Art</a></li>
## [19] <li><a href="/wiki/1987_in_aviation" title="1987 in aviation">Aviation</ ...
## [20] <li><a href="/wiki/Category:1987_awards" title="Category:1987 awards">Aw ...
## ...
get_deaths <- function(url){
# get the main content page
page <- url %>% read_html() %>%
html_nodes("#mw-content-text") %>% html_children() %>%
html_children()
# get the names of all elements
tagnames <- page %>% html_name()
# where are the big section headers
h2s <- which(tagnames == "h2")
# to find the heading labeled "Deaths"
h2childids <- page[h2s] %>% html_children() %>% html_attr("id")
idDeaths <- which(h2childids == "Deaths")
# list of deaths starts after the location of deathStart and
# ends immediately before the location of deathEnd (next big header)
deathStart <- h2s[(idDeaths+1)/2]
deathEnd <- h2s[(idDeaths+1)/2+1]
# get the deaths
death_elements <- page[(deathStart+1):(deathEnd-1)]
deaths <- death_elements %>% html_nodes("li") %>% html_text()(continued on next slide)
# there are two types of deaths: there was only one death that day in that year (a)
deathsa <- data.frame(death = deaths[grep("–", deaths)])
deathsa <- deathsa %>%
separate(death, into = c("Date", "Person"), sep = " – ") %>%
separate(Date, into = c("Month", "Day"), sep = " ") %>%
separate(Person, into = c("Name", "Desc"), sep = ", ", extra = "merge")
# or there were multiple deaths that day in that year (b)
deathsb <- data.frame(death = deaths[-grep("–", deaths)], stringsAsFactors = F)
# remove repeats
deathsb <- data.frame(death = deathsb[grep("\n",deathsb$death),], stringsAsFactors = F)
# tidy up the data
deathsb %>%
separate(death, into = c("Date", "Other"), sep = "\\n", extra="merge") %>%
separate(Other, into = paste0("Person", 1:6), sep = "\\n", fill = "right") %>%
gather(Person, Desc, -Date) %>%
select(Date, Desc) %>%
filter(!is.na(Desc)) -> deathsb
deathsb %>% separate(Desc, into = c("Name", "Desc"), sep = ", ", extra = "merge") %>%
separate(Date, into = c("Month", "Day"), sep = " ") %>%
filter(!is.na(Desc)) -> deathsb
#combine the 2 sets
deaths <- rbind(deathsa, deathsb)
return(deaths)
} # should take about 10 seconds
celebDeaths <- celebDeaths %>%
mutate(Deaths = map(url, get_deaths)) %>%
unnest()
head(celebDeaths[,-2])## # A tibble: 6 x 5
## year Month Day Name Desc
## <int> <chr> <chr> <chr> <chr>
## 1 1987 January 2 Jean de Gribaldy French road cyclist and directeur sporti…
## 2 1987 January 5 Herman Smith-Jo… Norwegian supercentenarian (b. 1875)
## 3 1987 January 9 Arthur Lake American actor (b. 1905)
## 4 1987 January 13 Turgut Demirağ Turkish film producer, director and scre…
## 5 1987 January 14 Douglas Sirk German-born film director (b. 1897)
## 6 1987 January 15 Ray Bolger American actor, singer, and dancer (b. 1…
celebDeaths %>%
group_by(year) %>%
summarise(num_deaths = n()) %>%
arrange(desc(num_deaths)) %>%
head(10)## # A tibble: 10 x 2
## year num_deaths
## <int> <int>
## 1 2016 410
## 2 1989 362
## 3 2015 358
## 4 1992 313
## 5 2014 304
## 6 1988 294
## 7 1987 275
## 8 2013 258
## 9 1993 247
## 10 1996 238
Rtidyverservest